This article outlines analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering the creation of annotated analysis files and running descriptive statistics on them with goals of understanding the data and the quality and completeness of the data. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement and to produce tabular and graphical statistical summaries. Several examples of processing and manipulating data using the data.table package are given. Finally, examples of caching results and doing parallel processing are presented.
I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.
With Quarto I specify that html files are to be self-contained, so there are no separate graphics files.
For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.
Analysis File Creation
I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table.rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds').
Templates for analysis reports are here, and a comprehensive report example may be found here.
When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function.
Variable labels and units of measurement are used in special ways in my R packages. This will show up in the describe and contents function outputs below and in axis labels for graphics.
To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically. .q is like the Hmisc function Cs but also allows elements to be named. It will be in Hmisc 4.7-1.
.q <-function(...) { s <-sys.call()[-1] w <-as.character(s) n <-names(s)if(length(n)) names(w) <- n w}.q(a, b, c, 'this and that')
[1] "a" "b" "c" "this and that"
.q(dog=a, giraffe=b, cat=c)
dog giraffe cat
"a" "b" "c"
Here is an upData example:
# Function to recode from atypical coding for yes/no in raw datayn <-function(x) factor(x, 0:1, c('yes', 'no'))d <-upData(d,rename =.q(gender=sex, any.event=anyEvent),posSE =yn(posSE),newMI =yn(newMI),newPTCA =yn(newPTCA),newCABG =yn(newCABG),death =yn(death),hxofHT =yn(hxofHT),hxofDM =yn(hxofDM),hxofCig =factor(hxofCig, c(0, 0.5, 1),c('heavy', 'moderate', 'non-smoker')), hxofMI =yn(hxofMI),hxofPTCA =yn(hxofPTCA),hxofCABG =yn(hxofCABG),chestpain=yn(chestpain),anyEvent =yn(anyEvent),drop=.q(event.no, phat, mics, deltaEF, newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp, hardness),labels=c(bhr ='Basal heart rate',basebp ='Basal blood pressure',basedp ='Basal Double Product bhr*basebp',age ='Age',pkhr ='Peak heart rate',sbp ='Systolic blood pressure',dp ='Double product pkhr*sbp',dose ='Dose of dobutamine given',maxhr ='Maximum heart rate',pctMphr ='Percent maximum predicted heart rate achieved',mbp ='Maximum blood pressure',dpmaxdo ='Double product on max dobutamine dose',dobdose ='Dobutamine dose at max double product',baseEF ='Baseline cardiac ejection fraction',dobEF ='Ejection fraction on dobutamine', chestpain ='Chest pain', ecg ='Baseline electrocardiogram diagnosis',restwma ='Resting wall motion abnormality on echocardiogram', posSE ='Positive stress echocardiogram',newMI ='New myocardial infarction',newPTCA ='Recent angioplasty',newCABG ='Recent bypass surgery', hxofHT ='History of hypertension', hxofDM ='History of diabetes',hxofMI ='History of myocardial infarction',hxofCig ='History of smoking',hxofPTCA ='History of angioplasty',hxofCABG ='History of coronary artery bypass surgery',anyEvent ='Death, newMI, newPTCA, or newCABG'),units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',pctMphr='%', dose=mg, dobdose=mg) )saveRDS(d, 'stressEcho.rds', compress='xz')# Note that we could have automatically recoded all 0:1 variables# if they were all to be treated identically:for(x innames(d)) if(all(d[[x]] %in%c(0,1))) d[[x]] <-yn(d[[x]])
For REDCap Users
If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:
require(Hmisc)
getRs('importREDCap.r', put='source') # source() code to define function
mydata <- importREDCap() # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')
When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.
Variable Naming
I prefer short but descriptive variable names. As exemplified above, I use variable labels and units to provide more information. For example I wouldn’t name the age varible age.at.enrollment.yrs but would name it age with a label of Age at Enrollment and with units of years. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age) at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below).
Hmisc package graphics and table making functions such as summaryM and summary.formula specially typeset units in a smaller font.
Data Dictionary
The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version of the dataset created above, which can be accessed with the HmiscgetHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.
The permanent version of stressEcho coded binary variables as 0/1 instead of N/Y.
require(Hmisc)getHdata(stressEcho)d <- stressEcho
Data Dictionary
html(contents(d), levelType='table')
Data frame:d
558 observations and 31 variables, maximum # NAs:0
You can write the text output of contents into a text file in your current working directory, and click on that file in the RStudioFiles window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.
Users having the xless system command installed can pop up a contents window at any time by typing xless(contents(d)) in the console. xless is in Hmisc.
capture.output(contents(d), file='contents.txt')
Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.
RStudio provides a nice way to do this, facilitated by the htmlView function which may be fetched from Github using the HmiscgetRs function. htmlView takes any number of objects for which an html method exists to render them. They are rendered in the RStudioViewer pane. If you are running outside RStudio, your default browser will be launched instead.
Occasionally RStudio Viewer will drop its arrow button making it impossible to navigate back and forth to different html outputs.
In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe function also shows labels, units, and levels).
For either approach it would be easy to have multiple tabs open, one tab for each of a series of data tables, or use htmlView.
Descriptive Statistics
The Hmiscdescribe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary. Clicking on Glossary on the right will pop up a browser window with a more in-depth glossary of terms used in Hmisc package output. It links to hbiostat.org/R/glossary.html which you can link from your reports that use Hmisc..
Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead. Glossary
Gmd in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.
restwma: Resting wall motion abnormality on echocardiogram
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.745
257
0.4606
0.4978
posSE: Positive stress echocardiogram
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.553
136
0.2437
0.3693
newMI: New myocardial infarction
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.143
28
0.05018
0.09549
newPTCA: Recent angioplasty
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.138
27
0.04839
0.09226
newCABG: Recent bypass surgery
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.167
33
0.05914
0.1115
death
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.123
24
0.04301
0.08247
hxofHT: History of hypertension
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.625
393
0.7043
0.4173
hxofDM: History of diabetes
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.699
206
0.3692
0.4666
hxofCig: History of smoking
n
missing
distinct
558
0
3
Value heavy moderate non-smoker
Frequency 122 138 298
Proportion 0.219 0.247 0.534
hxofMI: History of myocardial infarction
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.599
154
0.276
0.4004
hxofPTCA: History of angioplasty
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.204
41
0.07348
0.1364
hxofCABG: History of coronary artery bypass surgery
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.399
88
0.1577
0.2661
any.event: Death, newMI, newPTCA, or newCABG
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.402
89
0.1595
0.2686
ecg: Baseline electrocardiogram diagnosis
n
missing
distinct
558
0
3
Value normal equivocal MI
Frequency 311 176 71
Proportion 0.557 0.315 0.127
# To create a separate browser window:cat(html(w), file='desc.html')browseURL('desc.html', browser='firefox -new-window')
Better, whether using RStudio or not:
htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)
There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics.
To make creation of tabs in Quarto easier, here is a function that loops through the elements of a named list and outputs each element into a separate Quarto tab. A wide argument is used to expand the width of the output outside the usual margins. An initblank argument creates a first tab that is empty. This allows one to show nothing until one of the other tabs is clicked.
This function is in Github and can be defined using the HmiscgetRs function, i.e., getRs('maketabs.r', put='source').
By specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.
See this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).
The initblank option to the maketabs function creates the initially opened tab, which has no content, thereby leaving it to the user to click one of the other tabs before output is shown.
require(data.table)setDT(d) # turn d into a data tablew <- dw[, hxofMI :=factor(hxofMI, 0:1, c('No history of MI', 'History of MI'))]vars <-setdiff(names(d), 'hxofMI')form <-as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))print(form)
abc represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: 1Wilcoxon test; 2Pearson test .
Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.
d[, histboxp(x=maxhr, group=ecg, bins=200)]
Data Manipulation and Aggregation
The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a cheatsheet for data transformation is here. A master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.
Several data.table examples follow. I like to hold the current dataset in d to save typing.
## Analyzing Selected Variables and Subsets
Value male female
Frequency 220 338
Proportion 0.394 0.606
# Separate analysis by female, male. Use keyby instead of by to sort the usual way.d[, print(describe(age, descript=gender)), by=gender]
male : Age [years]
n missing distinct Info Mean Gmd .05 .10
220 0 51 0.999 67.86 12.91 45.95 51.00
.25 .50 .75 .90 .95
62.00 69.00 75.00 81.00 86.00
lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years]
n missing distinct Info Mean Gmd .05 .10
338 0 58 0.999 67.01 13.74 47.00 50.70
.25 .50 .75 .90 .95
59.25 68.00 76.00 83.00 85.00
lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by genderd[, .(Mean=mean(age), Median=median(age)), by=gender]
gender Mean Median
1: male 67.85909 69
2: female 67.00888 68
# To create a new subsetw <- d[gender =='female'& age <70, ]
Adding or Changing Variables
With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.
# Rename a variablesetnames(d, c('gender', 'height'),c( 'sex', 'ht'))# Easier:setnames(d, .q(gender, height),.q( sex, ht))# Orren <-.q(gender=sex, height=ht)setnames(d, names(ren), ren)# Orrename <-function(x, n) setnames(x, names(n), n)rename(d, .q(gender=sex, height=ht))# For all variables having baseline_ at the start of their names, remove itn <-names(d)setnames(d, n, sub('^baseline_', '', n)) # ^ = at start; use $ for at end# For all variables having baseline_ at the start of their names, remove it# and add Baseline to start of variable labeln <-names(d)n <- n[grepl('^baseline_', n)]ren <-sub('^baseline_', '', n); names(ren) <- n# Fetch vector of labels for variables whose names start with baseline_labs <-sapply(d, label)[n] # label() result is "" if no labellabs <-paste('Baseline', labs)d <-updata(d, rename=ren, labels=labs)# Change all names to lower casen <-names(d)setnames(d, n, tolower(n))# Abbreviate names of all variables longer than 10 characters# abbreviate() will ensure that all names are uniquen <-names(d)setnames(d, n, abbreviate(n, minlength=10))# For any variables having [...] or (...) in their labels, assume these# are units of measurement and move them from the label to the units# attributed <-upData(d, moveUnits=TRUE)# Add two new variables, storingresult in a new data tablez <- d[, .(bmi=wt / ht ^2, bsa=0.016667*sqrt(wt * ht))]# Add one new variable in placed[, bmi := wt / ht ^2]# Add two new variables in placed[, `:=`(bmi = wt / ht ^2, bas=0.016667*sqrt(wt * ht))]d[, .q(bmi, bsa) := .(wt / ht ^2, 0.016667*sqrt(wt * ht))]# Compute something requiring a different formula for different types# of observationsd[, htAdj :=ifelse(sex =='male', ht, ht *1.07)]d[, htAdj := ht *ifelse(sex =='male', 1, 1.07)]d[, htAdj := (sex =='male') * ht + (sex =='female') * ht *1.07]d[sex =='male', htAdj := ht]d[sex =='female', htAdj := ht *1.07]# Add label & optional units (better to use upData which works on data tables)adlab <-function(x, lab, un='') {label(x) <- labif(un !='') units(x) <- un x}d[, maxhr :=adlab(maxhr, 'Maximum Heart Rate', '/m')]# Delete a variable (or use upData)d[, bsa :=NULL]# Delete two variablesd[, `:=`(bsa=NULL, bmi=NULL)]d[, .q(bsa, bmi) :=NULL]
Recoding Variables
# Group levels a1, a2 as a & b1,b2,b3 as bd[, x :=factor(x, .q(a1,a2,b1,b2,b3),.q( a, a, b, b, b))]# Regroup an existing factor variablelevels(d$x) <-list(a=.q(a1,a2), b=.q(b1,b2,b3))# ord <-upData(d, levels=list(x=list(a=.q(a1,a2), b=.q(b1,b2,b3))))# Or manipulate character stringsd[, x :=substring(x, 1, 1)] # replace x with first character of levels# orlevels(d$x) <-substring(levels(d$x), 1, 1)# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2d[, x :=1* (x %in%1:3) +2* (x ==4)]# Recode a series of conditions to a factor variable whose value is taken# from the last condition that is TRUE using Hmisc::score.binary# Result is a factor variable unless you add retfactor=FALSEd[, x :=score.binary(x1 =='symptomatic', x2 %in%.q(stroke, MI), death)]# Same but code with numeric pointsd[, x :=score.binary(x1 =='symptomatic', x2 %in%.q(stroke, MI), death, # TRUE/FALSE or 1/0 variablepoints=c(1,2,10))]# Recode from one set of character strings to another using named vector# A named vector can be subscripted with character strings as well as integersstates <-c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)# Could also do:# states <- .q(AL=Alabama, AK=Alaska, AZ=Arizona, ..., NM='New Mexico', ...) # or do a merge for table lookup (see later)d[, State := states[state])# states are unique, state can have duplicates and all are recoded# Recode from integers 1, 2, ..., to character stringslabs <-.q(elephant, giraffe, dog, cat)d[, x := labs[x]]# Recode from character strings to integersd[, x :=match(x, labs)]
Computing Summary Statistics
Many applications can use the automatically created data.table object .SD which stands for the data table for the current group being processed. If .SDcols were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as .SDcols to restrict the analysis. If there were no by variable(s), .SD stands for the whole data table.
# Compute the number of distinct values for all variablesnd <-function(x) length(unique(x))d[, sapply(.SD, nd)]
Tabulate all variables having between 3 and 10 distinct values and create a side effect when data.table is running that makes the summarization function tab store all values and frequencies in a growing list Z so that kable can render a markdown table after we pad columns to the maximum length of all columns (maximum number of distinct values).
# Using <<- makes data.table have a side effect of augmenting Z and# Align in the global environmenttab <-function(x) { z <-table(x, useNA='ifany') i <-length(Z) Z[[i+1]] <<-names(z) Z[[i+2]] <<-as.vector(z) Align <<-c(Align, if(is.numeric(x)) 'r'else'l', 'r')length(z)}discr <-function(x) { i <-length(unique(x)); i >2& i <11 }Z <-list(); Align <-character(0)w <- d[, lapply(.SD, tab), .SDcols=discr]maxl <-max(w)# Pad shorter vectors with blanksZ <-lapply(Z, function(x) c(x, rep('', maxl -length(x))))Z <-do.call('cbind', Z) # combine all into columns of a matrixcolnames(Z) <-rep(names(w), each=2)colnames(Z)[seq(2, ncol(Z), by=2)] <-'Freq'knitr::kable(Z, align=Align)
dose
Freq
dobdose
Freq
hxofCig
Freq
ecg
Freq
10
2
5
7
heavy
122
normal
311
15
28
10
7
moderate
138
equivocal
176
20
47
15
55
non-smoker
298
MI
71
25
56
20
73
30
64
25
71
35
61
30
78
40
300
35
62
40
205
A better approach is to let the kables function put together a series of separate markdown tables of different sizes. By using the “updating Z in the global environment” side effect we are able to let data.table output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).
tab <-function(x) { z <-table(x, useNA='ifany') i <-length(Z) w <-matrix(cbind(names(z), z), ncol=2,dimnames=list(NULL, c(vnames[i+1], 'Freq'))) Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r'else'l', 'r'))length(z)}discr <-function(x) { i <-length(unique(x)); i >2& i <11 }Z <-list()vnames <-names(d[, .SD, .SDcols=discr])w <- d[, lapply(.SD, tab), .SDcols=discr]knitr::kables(Z)
dose
Freq
10
2
15
28
20
47
25
56
30
64
35
61
40
300
dobdose
Freq
5
7
10
7
15
55
20
73
25
71
30
78
35
62
40
205
hxofCig
Freq
heavy
122
moderate
138
non-smoker
298
ecg
Freq
normal
311
equivocal
176
MI
71
Use a similar side-effect approach to get separate htmldescribe output by gender.
g <-function(x, by) { Z[[length(Z) +1]] <<-describe(x, descript=paste('age for', by)) by}Z <-list()by <- d[, g(age, gender), by=gender]# Make Z look like describe() output for multiple variablesclass(Z) <-'describe'attr(Z, 'dimensions') <-c(nrow(d), nrow(by))attr(Z, 'descript') <-'Age by Gender'html(Z)
Age by Gender
2 Variables 558 Observations
age for male: Age years
n
missing
distinct
Info
Mean
Gmd
.05
.10
.25
.50
.75
.90
.95
220
0
51
0.999
67.86
12.91
45.95
51.00
62.00
69.00
75.00
81.00
86.00
lowest : 30 34 38 40 43 , highest: 88 89 91 92 93age for female: Age years
n
missing
distinct
Info
Mean
Gmd
.05
.10
.25
.50
.75
.90
.95
338
0
58
0.999
67.01
13.74
47.00
50.70
59.25
68.00
76.00
83.00
85.00
lowest : 26 28 29 33 34 , highest: 87 88 89 90 91
# Compute a 1-valued statistic on multiple variables, by cross-classification# of two variables. Do this on a subset. .SDcols=a:b uses variables in order# Use keyby instead of by to order output the usual wayd[age <70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp]
# Compute multiple statistics on one variable# Note: .N is a special variable: count of obs for current groupd[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)]
gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
# Same result another wayg <-function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))d[, g(bhr), by=.(gender, newMI)] # if g returned a vector instead, use as.list(g(bhr))
gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
# Similar but put percentile number in front of statistic value# Do only quartilesg <-function(x) { z <-quantile(x, (1:3)/4, na.rm=TRUE)paste(format(names(z)), format(round(z)))}d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)]
# To have more control over labeling and to have one row per sex:g <-function(x) { s <-sapply(x, quantile, na.rm=TRUE) # compute quantiles for all variables -> matrix h <-as.list(s) # vectorizes first# Cross row names (percentile names) with column (variable) names# paste(b, a) puts variable name in front of percentilenames(h) <-outer(rownames(s), colnames(s), function(a, b) paste(b, a)) h}d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)]
# Can put ! in front of a sequence of variables to do the opposite# To add duplicated means to raw data use e.g.# d[, Mean := mean(x), by=sex]
Merging Data
Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.
For more information see this, this, this, this and this. To merge any number of datasets at once and obtain a printed report of how the merge went, use the HmiscMerge function.
uid <-unique(c(b[, id], L[, id]))L[b[.(uid), on=.(id)]] # Keeps ids in either b or c (full outer join)
id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
11: 5 1 10 NA
12: 5 2 11 NA
13: 5 3 12 NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table
id age day y
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
For very large data tables, giving the data tables keys will speed execution, e.g.:
setkey(d, id)
setkey(d, state, city)
Join/merge can be used for data lookups:
s <-data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)stateAbbrevs <-data.table(state=state.abb, State=state.name)s[stateAbbrevs, , on=.(st=state), nomatch=NULL]
st y State
1: AL 5 Alabama
2: AK 4 Alaska
3: AZ 3 Arizona
4: CA 2 California
5: OK 1 Oklahoma
Reshaping Data
To reshape data from long to wide format, take the L data table above as an example.
w <-dcast(L, id ~ day, value.var='y')w
id 1 2 3 4
1: 2 1 2 3 NA
2: 3 4 5 6 7
3: 4 8 9 NA NA
4: 5 10 11 12 NA
# Now reverse the procedurem <-melt(w, id.vars='id', variable.name='day', value.name='y')m
id day y
1: 2 1 1
2: 3 1 4
3: 4 1 8
4: 5 1 10
5: 2 2 2
6: 3 2 5
7: 4 2 9
8: 5 2 11
9: 2 3 3
10: 3 3 6
11: 4 3 NA
12: 5 3 12
13: 2 4 NA
14: 3 4 7
15: 4 4 NA
16: 5 4 NA
For the longitudinal data table L carry forward to 4 days the subject’s last observation on y if it was assessed earlier than day 4.
w <-copy(L) # fresh start with no propagation of changes back to L# Only needed if will be using := to compute variables in-place and# you don't want the new variables also added to L# This is related to data.table doing things by reference instead of # making copies. w <- L does not create new memory for w.# Compute each day's within-subject record number and last record number# Feed this directly into a data.table operation to save last records # when the last is on a day < 4u <- w[, .q(seq, maxseq) := .(1: .N, .N), by=id][seq == maxseq & day <4,]# Extra observations to fill out to day 4u <- u[, .(day = (day +1) :4, y = y), by=id]u
id day y
1: 2 4 3
2: 4 3 9
3: 4 4 9
4: 5 4 12
w <-rbind(L, u, fill=TRUE)setkey(w, id, day) # sort and indexw
Same but instead of resulting in an infinite value if no observations for a subject meet the condition, make the result NA.
mn <-function(x) if(length(x)) min(x) elseas.double(NA)# as.double needed because day is stored as double precision# (type contents(L) to see this) and data.table requires# consistent storage typesL[, .(first3 =mn(day[y >=3]),first7 =mn(day[y >=7])), by=id]
id first3 first7
1: 2 3 NA
2: 3 1 4
3: 4 1 1
4: 5 1 1
Add a new variable z and compute the first day at which z is above 0.5 for two days in a row for the subject. Note that the logic below looks for consecutive days for which records exist for a subject. To also require the days to be one day apart add the clause day == shift(day) + 1 after shift(z) > 0.5.
id day y z consecutive firstday
1: 2 1 1 0.266 FALSE NA
2: 2 2 2 0.372 FALSE NA
3: 2 3 3 0.573 FALSE NA
4: 3 1 4 0.908 NA 4
5: 3 2 5 0.202 FALSE 4
6: 3 3 6 0.898 FALSE 4
7: 3 4 7 0.945 TRUE 4
8: 4 1 8 0.661 NA 2
9: 4 2 9 0.629 TRUE 2
10: 5 1 10 0.062 FALSE NA
11: 5 2 11 0.206 FALSE NA
12: 5 3 12 0.177 FALSE NA
Analysis
Formatting
For analysis the sky is the limit. I take advantage of special formatting for model fit objects from the rms package by using html or latex methods and putting results='asis' in the chunk header to preserve the special formatting. Here is an example.
require(rms)options(prType='html') # needed to use special formatting (can use prType='latex')set.seed(1)n <-1000w <-data.table(x=runif(n), x2=rnorm(n), y=sample(0:1, n, replace=TRUE))dd <-datadist(w); options(datadist='dd') # rms needs for summaries and plottingf <-lrm(y ~ x +rcs(x2, 4), data=w)f
Logistic Regression Model
lrm(formula = y ~ x + rcs(x2, 4), data = w)
Model Likelihood Ratio Test
Discrimination Indexes
Rank Discrim. Indexes
Obs 1000
LR χ2 1.91
R2 0.003
C 0.526
0 492
d.f. 4
R24,1000 0.000
Dxy 0.051
1 508
Pr(>χ2) 0.7516
R24,749.8 0.000
γ 0.051
max |∂log L/∂β| 2×10-14
Brier 0.249
τa 0.026
β
S.E.
Wald Z
Pr(>|Z|)
Intercept
0.1079
0.2976
0.36
0.7169
x
0.1506
0.2197
0.69
0.4931
x2
0.0502
0.2142
0.23
0.8147
x2’
-0.3705
0.6631
-0.56
0.5763
x2’’
1.3624
2.6219
0.52
0.6033
anova(f)
Wald Statistics for y
χ2
d.f.
P
x
0.47
1
0.4931
x2
1.44
3
0.6961
Nonlinear
0.33
2
0.8499
TOTAL
1.91
4
0.7523
Caching
The workhorse behind Rmarkdown and Quarto is knitr, which processes the code chunks and properly mingles code and tabular and graphical output. knitr has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the runifChanged function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with .rds appended).
# Read the source code for the hashCheck and runifChanged functions from# https://github.com/harrelfe/rscripts/blob/master/hashCheck.rgetRs('hashCheck.r', put='source')g <-function() {# Fit a logistic regression model and bootstrap it 500 times, saving# the matrix of bootstrapped coefficients f <-lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)bootcov(f, B=500)}set.seed(3)n <-2000dat <-data.table(x1=runif(n), x2=runif(n),y=sample(0:1, n, replace=TRUE))# runifChanged will write runifch.rds if needed (chunk name.rds)# Will run if dat or source code for lrm or bootcov changeb <-runifChanged(g, dat, lrm, bootcov)dim(b$boot.Coef)
The runParallel function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. runParallel makes the parallel package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number seed is given, and the seed is set to this plus i for core number i. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. runifChanged is again used, to avoid running the simulations if no inputs have changed.
# Load runParallel function from githubgetRs('runParallel.r', put='source')# Function to do simulations on one corerun1 <-function(reps, showprogress, core) { cof <-matrix(NA, nrow=reps, ncol=3,dimnames=list(NULL, .q(a, b1, b2)))for(i in1: reps) { y <-sample(0:1, n, replace=TRUE) f <-lrm(y ~ X) cof[i, ] <-coef(f) }list(coef=cof)}# Debug one core run, with only 3 iterationsn <-300seed <-3set.seed(seed)X <-cbind(x1=runif(n), x2=runif(n)) # condition on covariatesrun1(3)
The following output is created by the command markupSpecs$html$session(), where markupSpecs is defined in the Hmisc package.
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 21.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rms_6.3-1 SparseM_1.81 data.table_1.13.6 Hmisc_4.7-1
[5] ggplot2_3.3.3 Formula_1.2-4 survival_3.2-13 lattice_0.20-45
To cite R in publications use:
R Core Team (2022).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/.
---title: R Workflowauthor: - name: Frank Harrell url: https://hbiostat.orgdate: last-modifiedformat: html: self-contained: true anchor-sections: true code-tools: true code-fold: false code-link: false fig-width: 8 fig-height: 4 code-block-bg: "#f1f3f5" code-block-border-left: "#31BAE9" mainfont: Source Sans Pro theme: journal toc: true toc-depth: 3 toc-location: left captions: true cap-location: margin table-captions: true tbl-cap-location: margin reference-location: margin include-in-header: /home/harrelfe/doc/html/ganalytics.htmlcomments: hypothesis: trueexecute: warning: false message: false---<!-- To compile: quarto render rflow.qmd -->```{r include=FALSE}source('~/r/rmarkdown/qbookfun.r')```[`r hypcomment`]{.aside}This article outlines analysis project workflow that I've found to be efficient in making reproducible research reports using R with `Rmarkdown` and now `Quarto`. I start by covering the creation of annotated analysis files and running descriptive statistics on them with goals of understanding the data and the quality and completeness of the data. Functions in the `Hmisc` package are used to annotate data frames and data tables with labels and units of measurement and to produce tabular and graphical statistical summaries. Several examples of processing and manipulating data using the `data.table` package are given. Finally, examples of caching results and doing parallel processing are presented.# File Directory StructureI have a directory for each major project, and put everything in thatdirectory (including data files) except for graphics files forfigures, which are placed in their own subdirectory underneath theproject folder.[With `Quarto` I specify that `html` files are to be self-contained, so there are no separate graphics files.]{.aside} The directory name for the project includes keyidentifying information, and files within that directory do notcontain project information in their names, nor do they contain dates,unless I want to freeze an old version of an analysis script.For multi-analyst projects or ones in which you want to capture the entire code history, having the project on `github` is worthwhile.# Analysis File CreationI typically create a compact analysis file in a separate R scriptcalled `create.r` and have it produce compressed R binary `data.frame`or `data.table``.rds` files using `saveRDS(name, 'name.rds', compress='xz')`. Then I have an analysis script named for example `a.qmd` (for [Quarto](https://quarto.org) reports) or `a.Rmd` (for RMarkdown reports) that starts with `d <- readRDS('name.rds')`. [Templates for analysis reports are [here](http://hbiostat.org/rr/index.html#template), and a comprehensive report example may be found [here](https://hbiostat.org/R/Hmisc/markov).]{.aside}When variables need to be recoded, have labels added or changed, orhave units of measurement added[Variable labels and units of measurement are used in special ways in my R packages. This will show up in the `describe` and `contents` function outputs below and in axis labels for graphics.]{.aside}, I specify those using the `Hmisc` package [`upData`function](https://www.rdocumentation.org/packages/Hmisc/versions/4.7-0/topics/upData).To facilitate some operations requiring variable names to be quoted, define a function `.q` to quote them automatically. `.q` is like the `Hmisc` function `Cs` but also allows elements to be named. It will be in `Hmisc` 4.7-1.```{r quote}.q <-function(...) { s <-sys.call()[-1] w <-as.character(s) n <-names(s)if(length(n)) names(w) <- n w}.q(a, b, c, 'this and that').q(dog=a, giraffe=b, cat=c)```Here is an `upData` example:```{r eval=FALSE}# Function to recode from atypical coding for yes/no in raw datayn <-function(x) factor(x, 0:1, c('yes', 'no'))d <-upData(d,rename =.q(gender=sex, any.event=anyEvent),posSE =yn(posSE),newMI =yn(newMI),newPTCA =yn(newPTCA),newCABG =yn(newCABG),death =yn(death),hxofHT =yn(hxofHT),hxofDM =yn(hxofDM),hxofCig =factor(hxofCig, c(0, 0.5, 1),c('heavy', 'moderate', 'non-smoker')), hxofMI =yn(hxofMI),hxofPTCA =yn(hxofPTCA),hxofCABG =yn(hxofCABG),chestpain=yn(chestpain),anyEvent =yn(anyEvent),drop=.q(event.no, phat, mics, deltaEF, newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp, hardness),labels=c(bhr ='Basal heart rate',basebp ='Basal blood pressure',basedp ='Basal Double Product bhr*basebp',age ='Age',pkhr ='Peak heart rate',sbp ='Systolic blood pressure',dp ='Double product pkhr*sbp',dose ='Dose of dobutamine given',maxhr ='Maximum heart rate',pctMphr ='Percent maximum predicted heart rate achieved',mbp ='Maximum blood pressure',dpmaxdo ='Double product on max dobutamine dose',dobdose ='Dobutamine dose at max double product',baseEF ='Baseline cardiac ejection fraction',dobEF ='Ejection fraction on dobutamine', chestpain ='Chest pain', ecg ='Baseline electrocardiogram diagnosis',restwma ='Resting wall motion abnormality on echocardiogram', posSE ='Positive stress echocardiogram',newMI ='New myocardial infarction',newPTCA ='Recent angioplasty',newCABG ='Recent bypass surgery', hxofHT ='History of hypertension', hxofDM ='History of diabetes',hxofMI ='History of myocardial infarction',hxofCig ='History of smoking',hxofPTCA ='History of angioplasty',hxofCABG ='History of coronary artery bypass surgery',anyEvent ='Death, newMI, newPTCA, or newCABG'),units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',pctMphr='%', dose=mg, dobdose=mg) )saveRDS(d, 'stressEcho.rds', compress='xz')# Note that we could have automatically recoded all 0:1 variables# if they were all to be treated identically:for(x innames(d)) if(all(d[[x]] %in%c(0,1))) d[[x]] <-yn(d[[x]])```::: {.callout-note collapse="true"}## For REDCap UsersIf reading data exported from `REDCap` that are placed into theproject directory I run the following to get rid of duplicate(`factor` and non-`factor` versions of variables `REDCap` produces)variables and automatically convert dates to `Date` variables:```require(Hmisc)getRs('importREDCap.r', put='source') # source() code to define functionmydata <- importREDCap() # by default operates on last downloaded exportsaveRDS(mydata, 'mydata.rds', compress='xz')```When file names are not given to `importREDCap` the function looks forthe latest created `.csv` file and `.R` file with same prefix and uses those.See [this](http://hbiostat.org/bbr/md/r.html) for more information.:::## Variable NamingI prefer short but descriptive variable names. As exemplified above, I use variable `label`s and `units` to provide more information. For example I wouldn't name the age varible `age.at.enrollment.yrs` but would name it `age` with a label of `Age at Enrollment` and with `units` of `years`. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing `label(d$age)` at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below). [`Hmisc` package graphics and table making functions such as `summaryM` and `summary.formula` specially typeset `units` in a smaller font.]{.aside}## Data DictionaryThe `Hmisc` package `contents` function will provide a concise data dictionary. Here is an example using the permanent version[The permanent version of `stressEcho` coded binary variables as 0/1 instead of N/Y.]{.aside} of the dataset created above, which can be accessed with the `Hmisc``getHdata` function. The top of the `contents` output has the number of levels for `factor` variables hyperlinked. Click on the number to go directly to the list of levels for that variable.```{r getdat}require(Hmisc)getHdata(stressEcho)d <- stressEcho```::: {.callout-note collapse="true"}# Data Dictionary```{r ddict2,results='asis'}html(contents(d), levelType='table')```:::You can write the text output of `contents` into a text file in your current working directory, and click on that file in the `RStudio``Files` window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved. [Users having the `xless` system command installed can pop up a `contents` window at any time by typing `xless(contents(d))` in the console. `xless` is in `Hmisc`.]{.aside}```{r,eval=FALSE}capture.output(contents(d), file='contents.txt')```Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.```{r eval=FALSE}cat(html(contents(d)), file='contents.html')browseURL('contents.html', browser='vivaldi -new-window')````RStudio` provides a nice way to do this, facilitated by the `htmlView` function which may be fetched from `Github` using the `Hmisc``getRs` function. `htmlView` takes any number of objects for which an `html` method exists to render them. They are rendered in the `RStudio``Viewer` pane. If you are running outside `RStudio`, your default browser will be launched instead.[Occasionally `RStudio Viewer` will drop its arrow button making it impossible to navigate back and forth to different `html` outputs.]{.aside}```{r eval=FALSE}getRs('htmlView.r', put='source')htmlView(contents(d))```In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the `describe` function also shows labels, units, and levels). [For either approach it would be easy to have multiple tabs open, one tab for each of a series of data tables, or use `htmlView`.]{.aside}# Descriptive StatisticsThe `Hmisc``describe` function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The `Info` index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of `Info` will occur when a highly imbalanced variable is binary. Clicking on `Glossary` on the right will pop up a browser window with a more in-depth glossary of terms used in `Hmisc` package output. It links to `hbiostat.org/R/glossary.html` which you can link from your reports that use `Hmisc`.[`Info` comes from the [approximate formula](http://hbiostat.org/bib/r2.html) for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.<br><a href="https://hbiostat.org/comment.html" target="popup" onclick="window.open('https://hbiostat.org/R/glossary.html#describe', 'popup', 'width=450,height=600'); return false;"><small>Glossary</small></a>]{.aside}.`Gmd` in the output stands for Gini's mean difference---the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.::: {.callout-note .column-page collapse="true"}# `describe` Output```{r desc,results='asis'}w <-describe(d)html(w)```:::```{r eval=FALSE}# To create a separate browser window:cat(html(w), file='desc.html')browseURL('desc.html', browser='firefox -new-window')```Better, whether using `RStudio` or not:```{r eval=FALSE}htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)```There is also a `plot` method for `describe` output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use `ggplot2` to produce static graphics.```{r pdesc}p <-plot(w, bvspace=2.5)```::: {.panel-tabset}## Categorical Variables```{r pdesccat}p$Categorical```## Continuous Variables```{r pdesccon}p$Continuous```:::To make creation of tabs in `Quarto` easier, here is a function that loops through the elements of a named list and outputs each element into a separate `Quarto` tab. A `wide` argument is used to expand the width of the output outside the usual margins. An `initblank` argument creates a first tab that is empty. This allows one to show nothing until one of the other tabs is clicked. [This function is in `Github` and can be defined using the `Hmisc` `getRs` function, i.e., `getRs('maketabs.r', put='source')`.]{.aside}```{r maketabs,results='asis'}maketabs <-function(x, labels=names(x), wide=FALSE, initblank=FALSE) { yaml <-paste0('.panel-tabset', if(wide) ' .column-page') k <-c('', '::: {', yaml, '}', '')if(initblank) k <-c(k, '', '## ', '') .objmaketabs. <<- xfor(i in1:length(x)) { cname <-paste0('c', round(100000*runif(1))) k <-c(k, '', paste('##', labels[i]), '',paste0('```{r ', cname, ',results="asis",echo=FALSE}'),paste0('.objmaketabs.[[', i, ']]'), '```', '') } k <-c(k, ':::', '')cat(knitr::knit(text=knitr::knit_expand(text=k), quiet=TRUE))}maketabs(p)```By specifying `grType` option you can instead get `plotly` graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.```{r pldesc,results='asis'}options(grType='plotly')p <-plot(w, bvspace=2.5)maketabs(p, wide=TRUE)```See [this](http://hbiostat.org/R/Hmisc/examples.html) for other `Hmisc` functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The `summaryM` function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI). [The `initblank` option to the `maketabs` function creates the initially opened tab, which has no content, thereby leaving it to the user to click one of the other tabs before output is shown.]{.aside}```{r summaryM,results='asis'}require(data.table)setDT(d) # turn d into a data tablew <- dw[, hxofMI :=factor(hxofMI, 0:1, c('No history of MI', 'History of MI'))]vars <-setdiff(names(d), 'hxofMI')form <-as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))print(form)s <-summaryM(form, data=d, test=TRUE)w <-list('Table 1'=html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),'Categorical Variable Plot'=plot(s, which='categorical', vars=1:4, height=600, width=1000),'Continuous Variable Plot'=plot(s, which='continuous', vars=1:4))# initblank=TRUE: Put an empty tab first so that nothing initially displaysmaketabs(w, wide=TRUE, initblank=TRUE)```Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.```{r spikeh}d[, histboxp(x=maxhr, group=ecg, bins=200)]```# Data Manipulation and AggregationThe [`data.table`](https://hbiostat.org/R/data.table) package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is [this](https://riptutorial.com/data-table), and a cheatsheet for data transformation is [here](https://raw.githubusercontent.com/rstudio/cheatsheets/master/datatable.pdf). A master cheatsheet for `data.table` is [here](https://rdrr.io/cran/data.table/man/data.table.html) from which the general syntax below is taken, where `DT` represents a data table.``` DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows?```Data tables are also data frames so work on any method handling data frames. But data tables do not contain `rownames`.Several `data.table` examples follow. I like to hold the current dataset in `d` to save typing. ## Analyzing Selected Variables and Subsets```{r dtsimple,results='asis'}d[, html(describe(age))]d[, html(describe(~ age + gender))]d[gender =='female', html(describe(age))] # analyze age for femaleshtml(describe(d[, .(age, gender)], 'Age and Gender Stats'))``````{r dtsimple2}# Separate analysis by female, male. Use keyby instead of by to sort the usual way.d[, print(describe(age, descript=gender)), by=gender]# Compute mean and median age by genderd[, .(Mean=mean(age), Median=median(age)), by=gender]# To create a new subsetw <- d[gender =='female'& age <70, ]```## Adding or Changing VariablesWith `data.table` you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here `d` refers to a different data table from the one used above.```{r eval=FALSE}# Rename a variablesetnames(d, c('gender', 'height'),c( 'sex', 'ht'))# Easier:setnames(d, .q(gender, height),.q( sex, ht))# Orren <-.q(gender=sex, height=ht)setnames(d, names(ren), ren)# Orrename <-function(x, n) setnames(x, names(n), n)rename(d, .q(gender=sex, height=ht))# For all variables having baseline_ at the start of their names, remove itn <-names(d)setnames(d, n, sub('^baseline_', '', n)) # ^ = at start; use $ for at end# For all variables having baseline_ at the start of their names, remove it# and add Baseline to start of variable labeln <-names(d)n <- n[grepl('^baseline_', n)]ren <-sub('^baseline_', '', n); names(ren) <- n# Fetch vector of labels for variables whose names start with baseline_labs <-sapply(d, label)[n] # label() result is "" if no labellabs <-paste('Baseline', labs)d <-updata(d, rename=ren, labels=labs)# Change all names to lower casen <-names(d)setnames(d, n, tolower(n))# Abbreviate names of all variables longer than 10 characters# abbreviate() will ensure that all names are uniquen <-names(d)setnames(d, n, abbreviate(n, minlength=10))# For any variables having [...] or (...) in their labels, assume these# are units of measurement and move them from the label to the units# attributed <-upData(d, moveUnits=TRUE)# Add two new variables, storingresult in a new data tablez <- d[, .(bmi=wt / ht ^2, bsa=0.016667*sqrt(wt * ht))]# Add one new variable in placed[, bmi := wt / ht ^2]# Add two new variables in placed[, `:=`(bmi = wt / ht ^2, bas=0.016667*sqrt(wt * ht))]d[, .q(bmi, bsa) := .(wt / ht ^2, 0.016667*sqrt(wt * ht))]# Compute something requiring a different formula for different types# of observationsd[, htAdj :=ifelse(sex =='male', ht, ht *1.07)]d[, htAdj := ht *ifelse(sex =='male', 1, 1.07)]d[, htAdj := (sex =='male') * ht + (sex =='female') * ht *1.07]d[sex =='male', htAdj := ht]d[sex =='female', htAdj := ht *1.07]# Add label & optional units (better to use upData which works on data tables)adlab <-function(x, lab, un='') {label(x) <- labif(un !='') units(x) <- un x}d[, maxhr :=adlab(maxhr, 'Maximum Heart Rate', '/m')]# Delete a variable (or use upData)d[, bsa :=NULL]# Delete two variablesd[, `:=`(bsa=NULL, bmi=NULL)]d[, .q(bsa, bmi) :=NULL]```## Recoding Variables```{r, eval=FALSE}# Group levels a1, a2 as a & b1,b2,b3 as bd[, x :=factor(x, .q(a1,a2,b1,b2,b3),.q( a, a, b, b, b))]# Regroup an existing factor variablelevels(d$x) <-list(a=.q(a1,a2), b=.q(b1,b2,b3))# ord <-upData(d, levels=list(x=list(a=.q(a1,a2), b=.q(b1,b2,b3))))# Or manipulate character stringsd[, x :=substring(x, 1, 1)] # replace x with first character of levels# orlevels(d$x) <-substring(levels(d$x), 1, 1)# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2d[, x :=1* (x %in%1:3) +2* (x ==4)]# Recode a series of conditions to a factor variable whose value is taken# from the last condition that is TRUE using Hmisc::score.binary# Result is a factor variable unless you add retfactor=FALSEd[, x :=score.binary(x1 =='symptomatic', x2 %in%.q(stroke, MI), death)]# Same but code with numeric pointsd[, x :=score.binary(x1 =='symptomatic', x2 %in%.q(stroke, MI), death, # TRUE/FALSE or 1/0 variablepoints=c(1,2,10))]# Recode from one set of character strings to another using named vector# A named vector can be subscripted with character strings as well as integersstates <-c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)# Could also do:# states <- .q(AL=Alabama, AK=Alaska, AZ=Arizona, ..., NM='New Mexico', ...) # or do a merge for table lookup (see later)d[, State := states[state])# states are unique, state can have duplicates and all are recoded# Recode from integers 1, 2, ..., to character stringslabs <-.q(elephant, giraffe, dog, cat)d[, x := labs[x]]# Recode from character strings to integersd[, x :=match(x, labs)]```# Computing Summary StatisticsMany applications can use the automatically created `data.table` object `.SD` which stands for the data table for the current group being processed. If `.SDcols` were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as `.SDcols` to restrict the analysis. If there were no `by` variable(s), `.SD` stands for the whole data table.```{r ss}# Compute the number of distinct values for all variablesnd <-function(x) length(unique(x))d[, sapply(.SD, nd)]# Same but only for variables whose names contain hx and either D or Md[, sapply(.SD, nd), .SDcols=patterns('hx', 'D|M')]# Compute means on all numeric variablesmn <-function(x) mean(x, na.rm=TRUE)d[, lapply(.SD, mn), .SDcols=is.numeric]# Compute means on all numeric non-binary variablesnnb <-function(x) is.numeric(x) &&length(unique(x)) >2d[, lapply(.SD, mn), .SDcols=nnb]# Print frequency tables of all categorical variables with > 2 levelscmult <-function(x) !is.numeric(x) &&length(unique(x)) >2tab <-function(x) { z <-table(x, useNA='ifany')paste(paste0(names(z), ': ', z), collapse=', ')}d[, lapply(.SD, tab), .SDcols=cmult]```Tabulate all variables having between 3 and 10 distinct values and create a side effect when `data.table` is running that makes the summarization function `tab` store all values and frequencies in a growing list `Z` so that `kable` can render a `markdown` table after we pad columns to the maximum length of all columns (maximum number of distinct values). ```{r diffreq}# Using <<- makes data.table have a side effect of augmenting Z and# Align in the global environmenttab <-function(x) { z <-table(x, useNA='ifany') i <-length(Z) Z[[i+1]] <<-names(z) Z[[i+2]] <<-as.vector(z) Align <<-c(Align, if(is.numeric(x)) 'r'else'l', 'r')length(z)}discr <-function(x) { i <-length(unique(x)); i >2& i <11 }Z <-list(); Align <-character(0)w <- d[, lapply(.SD, tab), .SDcols=discr]maxl <-max(w)# Pad shorter vectors with blanksZ <-lapply(Z, function(x) c(x, rep('', maxl -length(x))))Z <-do.call('cbind', Z) # combine all into columns of a matrixcolnames(Z) <-rep(names(w), each=2)colnames(Z)[seq(2, ncol(Z), by=2)] <-'Freq'knitr::kable(Z, align=Align)```A better approach is to let the `kables` function put together a series of separate `markdown` tables of different sizes. By using the "updating `Z` in the global environment" side effect we are able to let `data.table` output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).```{r tabalt}tab <-function(x) { z <-table(x, useNA='ifany') i <-length(Z) w <-matrix(cbind(names(z), z), ncol=2,dimnames=list(NULL, c(vnames[i+1], 'Freq'))) Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r'else'l', 'r'))length(z)}discr <-function(x) { i <-length(unique(x)); i >2& i <11 }Z <-list()vnames <-names(d[, .SD, .SDcols=discr])w <- d[, lapply(.SD, tab), .SDcols=discr]knitr::kables(Z)```Use a similar side-effect approach to get separate `html``describe` output by `gender`.```{r bydesc,results='asis'}g <-function(x, by) { Z[[length(Z) +1]] <<-describe(x, descript=paste('age for', by)) by}Z <-list()by <- d[, g(age, gender), by=gender]# Make Z look like describe() output for multiple variablesclass(Z) <-'describe'attr(Z, 'dimensions') <-c(nrow(d), nrow(by))attr(Z, 'descript') <-'Age by Gender'html(Z)``````{r morestats}# Compute a 1-valued statistic on multiple variables, by cross-classification# of two variables. Do this on a subset. .SDcols=a:b uses variables in order# Use keyby instead of by to order output the usual wayd[age <70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp]# Compute multiple statistics on one variable# Note: .N is a special variable: count of obs for current groupd[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)]# Same result another wayg <-function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))d[, g(bhr), by=.(gender, newMI)] # if g returned a vector instead, use as.list(g(bhr))d[, as.list(quantile(bhr)), by=gender]# Compute multiple statistics on multiple variablesd[, lapply(.SD, quantile), by=gender, .SDcols=.q(bhr, pkhr, sbp)]# Similar but put percentile number in front of statistic value# Do only quartilesg <-function(x) { z <-quantile(x, (1:3)/4, na.rm=TRUE)paste(format(names(z)), format(round(z)))}d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)]# To have more control over labeling and to have one row per sex:g <-function(x) { s <-sapply(x, quantile, na.rm=TRUE) # compute quantiles for all variables -> matrix h <-as.list(s) # vectorizes first# Cross row names (percentile names) with column (variable) names# paste(b, a) puts variable name in front of percentilenames(h) <-outer(rownames(s), colnames(s), function(a, b) paste(b, a)) h}d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)]# Restrict to variables bhr - basedp in order columns created in data tabled[, g(.SD), by=gender, .SDcols=bhr : basedp]# Can put ! in front of a sequence of variables to do the opposite# To add duplicated means to raw data use e.g.# d[, Mean := mean(x), by=sex]```# Merging DataConsider a baseline dataset `b` and a longitudinal dataset `L`, with subject ID of `id`. [For more information see [this](https://hbiostat.org/R/data.table), [this](https://rdrr.io/cran/data.table/man/data.table.html), [this](https://medium.com/analytics-vidhya/r-data-table-joins-48f00b46ce29), [this](https://stackoverflow.com/questions/15170741) and [this](https://stackoverflow.com/questions/13493124). To merge any number of datasets at once and obtain a printed report of how the merge went, use the `Hmisc` `Merge` function.]{.aside}```{r two}b <-data.table(id=1:4, age=c(21, 28, 32, 23), key='id')L <-data.table(id =c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),day =c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),y =1:12, key='id')bL# Merge b and L to look up baseline age and associate it with all follow-upsb[L, on=.(id)] # Keeps all ids in L (left inner join)L[b, on=.(id)] # Keeps all ids in b (right inner join)L[b, on=.(id), nomatch=NULL] # Keeps only ids in both b and L (right outer join)uid <-unique(c(b[, id], L[, id]))L[b[.(uid), on=.(id)]] # Keeps ids in either b or c (full outer join)merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table```For very large data tables, giving the data tables _keys_ will speed execution, e.g.:```setkey(d, id)setkey(d, state, city)```Join/merge can be used for data lookups:```{r lookup}s <-data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)stateAbbrevs <-data.table(state=state.abb, State=state.name)s[stateAbbrevs, , on=.(st=state), nomatch=NULL]```# Reshaping DataTo reshape data from long to wide format, take the `L` data table above as an example.```{r l2w}w <-dcast(L, id ~ day, value.var='y')w# Now reverse the procedurem <-melt(w, id.vars='id', variable.name='day', value.name='y')msetkey(m, id, day) # sortsm[!is.na(y)]```# Other Manipulations of Longitudinal DataFor the longitudinal data table `L` carry forward to 4 days the subject's last observation on `y` if it was assessed earlier than day 4.```{r}w <-copy(L) # fresh start with no propagation of changes back to L# Only needed if will be using := to compute variables in-place and# you don't want the new variables also added to L# This is related to data.table doing things by reference instead of # making copies. w <- L does not create new memory for w.# Compute each day's within-subject record number and last record number# Feed this directly into a data.table operation to save last records # when the last is on a day < 4u <- w[, .q(seq, maxseq) := .(1: .N, .N), by=id][seq == maxseq & day <4,]# Extra observations to fill out to day 4u <- u[, .(day = (day +1) :4, y = y), by=id]uw <-rbind(L, u, fill=TRUE)setkey(w, id, day) # sort and indexw```Find the first time at which y >= 3 and at which y >= 7```{r firstt}L[, .(first3 =min(day[y >=3]),first7 =min(day[y >=7])), by=id]```Same but instead of resulting in an infinite value if no observations for a subject meet the condition, make the result `NA`.```{r firstna}mn <-function(x) if(length(x)) min(x) elseas.double(NA)# as.double needed because day is stored as double precision# (type contents(L) to see this) and data.table requires# consistent storage typesL[, .(first3 =mn(day[y >=3]),first7 =mn(day[y >=7])), by=id]```Add a new variable `z` and compute the first day at which `z` is above 0.5 for two days in a row for the subject. Note that the logic below looks for consecutive days _for which records exist for a subject_. To also require the days to be one day apart add the clause `day == shift(day) + 1` after `shift(z) > 0.5`.```{r}set.seed(1)L[, z :=round(runif(.N), 3)]Lmn <-function(x)if(!length(x) ||all(is.na(x))) as.double(NA) elsemin(x, na.rm=TRUE)L[, consecutive := z >0.5&shift(z) >0.5, by=id][, firstday :=mn(day[consecutive]), by=id]L```# Analysis## FormattingFor analysis the sky is the limit. I take advantage of special formatting for model fit objects from the `rms` package by using `html` or `latex` methods and putting `results='asis'` in the chunk header to preserve the special formatting. Here is an example.```{r lrm,results='asis'}require(rms)options(prType='html') # needed to use special formatting (can use prType='latex')set.seed(1)n <-1000w <-data.table(x=runif(n), x2=rnorm(n), y=sample(0:1, n, replace=TRUE))dd <-datadist(w); options(datadist='dd') # rms needs for summaries and plottingf <-lrm(y ~ x +rcs(x2, 4), data=w)fanova(f)```## CachingThe workhorse behind `Rmarkdown` and `Quarto` is [knitr](https://www.r-project.org/nosvn/pandoc/knitr.html), which processes the code chunks and properly mingles code and tabular and graphical output. `knitr` has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the [runifChanged](https://github.com/harrelfe/rscripts/blob/master/hashCheck.r) function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with `.rds` appended).```{r runifch}# Read the source code for the hashCheck and runifChanged functions from# https://github.com/harrelfe/rscripts/blob/master/hashCheck.rgetRs('hashCheck.r', put='source')g <-function() {# Fit a logistic regression model and bootstrap it 500 times, saving# the matrix of bootstrapped coefficients f <-lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)bootcov(f, B=500)}set.seed(3)n <-2000dat <-data.table(x1=runif(n), x2=runif(n),y=sample(0:1, n, replace=TRUE))# runifChanged will write runifch.rds if needed (chunk name.rds)# Will run if dat or source code for lrm or bootcov changeb <-runifChanged(g, dat, lrm, bootcov)dim(b$boot.Coef)head(b$boot.Coef)```## Parallel ComputingThe `runParallel` function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. `runParallel` makes the `parallel` package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number `seed` is given, and the seed is set to this plus `i` for core number `i`. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. `runifChanged` is again used, to avoid running the simulations if no inputs have changed.```{r rflowpar}# Load runParallel function from githubgetRs('runParallel.r', put='source')# Function to do simulations on one corerun1 <-function(reps, showprogress, core) { cof <-matrix(NA, nrow=reps, ncol=3,dimnames=list(NULL, .q(a, b1, b2)))for(i in1: reps) { y <-sample(0:1, n, replace=TRUE) f <-lrm(y ~ X) cof[i, ] <-coef(f) }list(coef=cof)}# Debug one core run, with only 3 iterationsn <-300seed <-3set.seed(seed)X <-cbind(x1=runif(n), x2=runif(n)) # condition on covariatesrun1(3)nsim <-5000g <-function() runParallel(run1, reps=nsim, seed=seed)Coefs <-runifChanged(g, X, run1, nsim, seed)dim(Coefs)apply(Coefs, 2, mean)```# Computing EnvironmentThe following output is created by the command `markupSpecs$html$session()`, where `markupSpecs` is defined in the `Hmisc` package.`r markupSpecs$html$session()`